This is the Data Visualization Project from the Udemy course Data Science and Machine Learning Bootcamp, taught by Jose Portilla.
The source of the data is the Economist.
I am not following exactly the requirements set out by the course. The reason is that, in addition to the data visualization aspects of the project, I also want to practise using hypothesis testing.
Key Points
- HDI is the Human Development Index, created by the United Nations. The HDI “is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living”.United Nations Development Programme, Human Development Reports. The HDI is measured on a scale of 0-1, with 0 being the lowest score (i.e. people in that country have a very low quality of life) and 1 (i.e. people have a very low quality of life).
- CPI is the Corruption Perceptions Index, developed by Transparency International. The CPI “ranks 180 countries and territories by their perceived levels of public sector corruption, according to experts and business people.”Transparency International. CPI is measured on a scale of 0-100, with 0 being very corrupt and 100 being very clean.
Hypothesis
My starting hypothesis is that HDI is positively correlated with CPI. That is: * a country with a low HDI will have a low CPI * a country with a medium HDI will have a medium CPI * a country with a high HDI will have a high CPI. H1 = correlation(HDI,CPI) > 0
This means that the null hypothesis can be stated as:
HDI is negatively correlated with CPI, or there is no correlation H0 = correlation(HDI,CPI) <= 0
So, the purposes of my project are: 1. To test whether the null hypothesis can be rejected 2. To visualize the data using ggplot2
Import the ggplot2 data.table libraries and use fread to load the csv file ‘Economist_Assignment_Data.csv’ into a dataframe called df (Hint: use drop=1 to skip the first column)
library(ggplot2)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
library(ggthemes)
library(readr)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
df <- read.csv('./Training Exercises/Capstone and Data Viz Projects/Data Visualization Project/Economist_Assignment_Data.csv')
str(df)
'data.frame': 173 obs. of 6 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ Country : Factor w/ 173 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
$ HDI.Rank: int 172 70 96 148 45 86 2 19 91 53 ...
$ HDI : num 0.398 0.739 0.698 0.486 0.797 0.716 0.929 0.885 0.7 0.771 ...
$ CPI : num 1.5 3.1 2.9 2 3 2.6 8.8 7.8 2.4 7.3 ...
$ Region : Factor w/ 6 levels "Americas","Asia Pacific",..: 2 3 5 6 1 3 2 4 3 1 ...
colnames(df)
[1] "X" "Country" "HDI.Rank" "HDI" "CPI" "Region"
nrow(df)
[1] 173
## Check the head of df
head(df)
## Use ggplot() + geom_point() to create a scatter plot object called pl. You will need to specify x=CPI and y=HDI and color=Region as aesthetics
#pl <- ggplot(df, aes(x=CPI,y=HDI)) + geom_point(aes(color=Region))
## Change the points to be larger empty circles. (You'll have to go back and add arguments to geom_point() and reassign it to pl.) You'll need to figure out what shape= and size=
pl <- ggplot(df, aes(x=CPI,y=HDI)) + geom_point(aes(color=Region), shape = 1, size=3 )
pl

Add geom_smooth(aes(group=1)) to add a trend line
pl + geom_smooth(aes(group=1))

We want to further edit this trend line. Add the following arguments to geom_smooth (outside of aes):
method = ‘lm’ formula = y ~ log(x) se = FALSE color = ‘red’ For more info on these arguments, check out the documentation under the Arguments list for details.
Assign all of this to pl2
pl2 <- pl + geom_smooth(aes(group=1), method = 'lm', formula = y ~ log(x), se=FALSE, color='red')
pl2

It’s really starting to look similar! But we still need to add labels, we can use geom_text! Add geom_text(aes(label=Country)) to pl2 and see what happens. (Hint: It should be way too many labels)
pl2 + geom_text(aes(label=Country))

Labeling a subset is actually pretty tricky! So we’re just going to give you the answer since it would require manually selecting the subset of countries we want to label!
pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan", "Afghanistan", "Congo", "Greece", "Argentina", "Brazil", "India", "Italy", "China", "South Africa", "Spain", "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France", "United States", "Germany", "Britain", "Barbados", "Norway", "Japan", "New Zealand", "Singapore")
pl3 <- pl2 + geom_text(aes(label = Country), color = "gray20", fontface='bold',
data = subset(df, Country %in% pointsToLabel),check_overlap = TRUE)
pl3

## Almost there! Still not perfect, but good enough for this assignment. Later on we’ll see why interactive plots are better for labeling. Now let’s just add some labels and a theme, set the x and y scales and we’re done!
Add theme_bw() to your plot and save this to pl4
pl4 <- pl3 + theme_bw()
pl4

Add scale_x_continuous() and set the following arguments:
name = Same x axis as the Economist Plot limits = Pass a vector of appropriate x limits breaks = 1:10
pl4 <- pl4 + scale_x_continuous(name='Corruption Perceptions Index, 2011(10=least corrupt)', limits = c(0.9,10.5), breaks = 1:10)
pl4

Now use scale_y_continuous to do similar operations to the y axis!
pl4 <- pl4 + scale_y_continuous(name='Human Development Index, 2011 (1=Best)', limits = c(0.2,1.0))
pl4

Finally use ggtitle() to add a string as a title.
pl4 <- pl4 + ggtitle('Corruption and Human Development') + theme_economist()
library(plotly)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
gpl4 <- ggplotly(pl4)
gpl4
Results
A look at the graph shows a fairly clear positive correlation between HDI and CPI. I would estimate that it’s somewhere between 0.5 and 1. Therefore, it looks like we can reject the null hypothesis.
Hang on! What about significance? Is it possible the positive correlation could be due to chance, and that the difference is not statistically significant?
# calculate the correlation coefficient (because I haven't specified a method, this will use the default Pearson Correlation Coefficient)
correlationCoefficient <- cor(df$HDI, df$CPI)
library(stringr)
str_glue("Correlation Coefficient: {correlationCoefficient}")
Correlation Coefficient: 0.704870476914723
Use the cor.test() function to test the hypothesis. This will find the p-value.
correlationTest <- cor.test(df$HDI, df$CPI)
# class(correlationTest)
correlationTest
Pearson's product-moment correlation
data: df$HDI and df$CPI
t = 12.994, df = 171, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6209764 0.7727980
sample estimates:
cor
0.7048705
p_value <- format(2.2e-16, scientific = FALSE)
str_glue("P-value: {p_value}")
P-value: 0.00000000000000022
Conclusion: we can see that the p-value is 0.00000000000000022, which is less than 0.05 (5%), which is the significance level I am using. The p-value is the probability that we would have found the current result if the correlation coefficient was zero. That is, there would be a 5% possibility of getting these results, by chance alone. Because the p-value is less than 0.05, we can reject that possibility. Therefore, we can reject the null hypothesis.
---
title: "Data Visualization Project"
output: html_notebook
---
 
This is the Data Visualization Project from the Udemy course [Data Science and Machine Learning Bootcamp](https://www.udemy.com/course/data-science-and-machine-learning-bootcamp-with-r/learn/lecture/5397722#content), taught by Jose Portilla.

The source of the data is the [Economist](https://www.economist.com/graphic-detail/2011/12/02/corrosive-corruption).

I am not following exactly the requirements set out by the course. The reason is that, in addition to the data visualization aspects of the project, I also want to practise using hypothesis testing.

## Key Points
* HDI is the Human Development Index, created by the United Nations. The HDI "is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living".[United Nations Development Programme, Human Development Reports](http://hdr.undp.org/en/content/human-development-index-hdi). The HDI is measured on a scale of 0-1, with 0 being the lowest score (i.e. people in that country have a very low quality of life) and 1 (i.e. people have a very low quality of life).
* CPI is the Corruption Perceptions Index, developed by Transparency International. The CPI "ranks 180 countries and territories by their perceived levels of public sector corruption, according to experts and business people."[Transparency International](https://www.transparency.org/cpi2019). CPI is measured on a scale of 0-100, with 0 being very corrupt and 100 being very clean.

## Hypothesis
My starting hypothesis is that HDI is positively correlated with CPI. That is:
* a country with a low HDI will have a low CPI
* a country with a medium HDI will have a medium CPI 
* a country with a high HDI will have a high CPI.
  _H1 = correlation(HDI,CPI) > 0_

This means that the null hypothesis can be stated as:

HDI is negatively correlated with CPI, or there is no correlation
  _H0 = correlation(HDI,CPI) <= 0_

So, the purposes of my project are:
1. To test whether the null hypothesis can be rejected
2. To visualize the data using `ggplot2` 

## Import the ggplot2 data.table libraries and use fread to load the csv file 'Economist_Assignment_Data.csv' into a dataframe called df (Hint: use drop=1 to skip the first column)

```{r}
library(ggplot2)
library(ggthemes)
library(readr)
library(dplyr)
df <- read.csv('./Training Exercises/Capstone and Data Viz Projects/Data Visualization Project/Economist_Assignment_Data.csv')
str(df)
colnames(df)
 
nrow(df)

## Check the head of df
head(df)
```

 ## Use ggplot() + geom_point() to create a scatter plot object called pl. You will need to specify x=CPI and y=HDI and color=Region as aesthetics
```{r}
#pl <- ggplot(df, aes(x=CPI,y=HDI)) + geom_point(aes(color=Region))

## Change the points to be larger empty circles. (You'll have to go back and add arguments to geom_point() and reassign it to pl.) You'll need to figure out what shape= and size=
pl <- ggplot(df, aes(x=CPI,y=HDI)) + geom_point(aes(color=Region), shape = 1, size=3 )


pl
```
 
## Add geom_smooth(aes(group=1)) to add a trend line
```{r}
pl + geom_smooth(aes(group=1))
```

## We want to further edit this trend line. Add the following arguments to geom_smooth (outside of aes):

method = 'lm'
formula = y ~ log(x)
se = FALSE
color = 'red'
For more info on these arguments, check out the documentation under the Arguments list for details.

Assign all of this to pl2
```{r}
pl2 <- pl + geom_smooth(aes(group=1), method = 'lm', formula = y ~ log(x), se=FALSE, color='red')

pl2
```

## It's really starting to look similar! But we still need to add labels, we can use geom_text! Add geom_text(aes(label=Country)) to pl2 and see what happens. (Hint: It should be way too many labels)
```{r}
pl2 + geom_text(aes(label=Country))

```

## Labeling a subset is actually pretty tricky! So we're just going to give you the answer since it would require manually selecting the subset of countries we want to label!
```{r}
pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan", "Afghanistan", "Congo", "Greece", "Argentina", "Brazil", "India", "Italy", "China", "South Africa", "Spain", "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France", "United States", "Germany", "Britain", "Barbados", "Norway", "Japan", "New Zealand", "Singapore")

pl3 <- pl2 + geom_text(aes(label = Country), color = "gray20", fontface='bold',
                data = subset(df, Country %in% pointsToLabel),check_overlap = TRUE)

pl3
```

## Almost there! Still not perfect, but good enough for this assignment. Later on we'll see why interactive plots are better for labeling. Now let's just add some labels and a theme, set the x and y scales and we're done!

Add theme_bw() to your plot and save this to pl4
```{r}
pl4 <- pl3 + theme_bw()
pl4
```

## Add scale_x_continuous() and set the following arguments:

name = Same x axis as the Economist Plot
limits = Pass a vector of appropriate x limits
breaks = 1:10
```{r}
pl4 <- pl4 + scale_x_continuous(name='Corruption Perceptions Index, 2011(10=least corrupt)', limits = c(0.9,10.5), breaks = 1:10)
pl4
```

## Now use scale_y_continuous to do similar operations to the y axis!
```{r}
pl4 <- pl4 + scale_y_continuous(name='Human Development Index, 2011 (1=Best)', limits = c(0.2,1.0))
pl4
```

## Finally use ggtitle() to add a string as a title.
```{r}
pl4 <- pl4 + ggtitle('Corruption and Human Development') + theme_economist()

library(plotly)
gpl4 <- ggplotly(pl4)
gpl4
```

## Results
A look at the graph shows a fairly clear positive correlation between HDI and CPI. I would estimate that it's somewhere between 0.5 and 1. Therefore, it looks like we can reject the null hypothesis.

Hang on! What about significance? Is it possible the positive correlation could be due to chance, and that the difference is not statistically significant?

```{r}

# calculate the correlation coefficient (because I haven't specified a method, this will use the default Pearson Correlation Coefficient)
correlationCoefficient <- cor(df$HDI, df$CPI)
library(stringr)
str_glue("Correlation Coefficient: {correlationCoefficient}")
```
## Use the cor.test() function to test the hypothesis. This will find the p-value.
```{r}
correlationTest <- cor.test(df$HDI, df$CPI)
# class(correlationTest)
correlationTest
p_value <- format(2.2e-16, scientific = FALSE)
str_glue("P-value: {p_value}")
```

Conclusion: we can see that the p-value is 0.00000000000000022, which is less than 0.05 (5%), which is the significance level I am using. The p-value is the probability that we would have found the current result if the correlation coefficient was zero. That is, there would be a 5% possibility of getting these results, by chance alone. Because the p-value is less than 0.05, we can reject that possibility. Therefore, we can reject the null hypothesis. 



